{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Principal Component Analysis (PCA)\n",
    "This notebook explains the basic concepts of Principal Component Analysis (PCA).\n",
    "\n",
    "Farhad Kamangar\n",
    "Feb. 10, 2017"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What is PCA?\n",
    "The main goal of Principal Components Analysis (PCA) is  to transform a set of multi-dimensional data points from their original space to another space such that the correlation between the variables in the transform space is minimized.\n",
    "Once the data is projected into the new space, some of the dimensions which have low variance can be ignored and the original data may be presented in a lower dimensional space without too much loss of information."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Presentation\n",
    "\n",
    "Suppose we have a set of $N$ data points in a $D$ dimensional space. This means that each data point is a point in a $D$ dimensional space and the data set can be presented as an $N$ by $D$ matrix.\n",
    "\n",
    "For example assume that we measure the height (inches), weight (pounds), and their waist size (inches) of 5 people and present it as a data set:\n",
    "$$\\large \\left( {\\matrix{ 65 & 72 & 68 & 80 & 66 \\cr  150 & 180 & 156 & 190 & 152  \\cr   30 & 34 & 32 & 38 & 36  \\cr  } } \\right)$$\n",
    "\n",
    "Notes:\n",
    "* Each column represents one person ( one data point).\n",
    "* Each row represents an attribute, such as height (a dimension)\n",
    "* Each person can be represented a single point, $X$, in a 3-dimensional space.\n",
    "\n",
    "$$\\large X= \\left[ {\\matrix{ x_1 \\cr  x_2  \\cr   x_3  \\cr  } } \\right]$$\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Variance and Covariance\n",
    "Variance of an attribute (a dimension) is defined as: \n",
    "\n",
    "$$\\large {\\mathop{\\rm var}} ({x_i}) = {{\\sum\\limits_{k = 1}^N {{{({x_{i,k}} - \\mu_i)}^2}} } \\over {(N - 1)}}$$\n",
    "\n",
    "where $x_{i,k}$ is the value of the $i_{th}$ attribute in sample $k$, and $\\mu_i$ is the expected value of the $i_{th}$ attribute.\n",
    "\n",
    "Covariance between two attributes (two dimensions) is defined as:\n",
    "\n",
    "$$\\large {\\mathop{\\rm cov}} ({x_i},{x_j}) = {{\\sum\\limits_{k = 1}^N {{{({x_{i,k}} - \\mu_i)({x_{j,k}} - \\mu_j)}}} } \\over {(N - 1)}}$$\n",
    "\n",
    "Notes:\n",
    "* If covariance between two attributes is positive, then those dimensions have the tendency to increase together.\n",
    "* If covariance between two attributes is negative, then those dimensions have the tendency to increase or decrease opposite of each other (one increases, the other decreases)\n",
    "* If covariance between two attributes is zero, then the two attributes are independent of each other.  \n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What is a Covariance Matrix?\n",
    "Covariance matrix shows the variance and the correlation between the dimensions in a multi-dimensional data.\n",
    "Suppose we have a set of $N$ data points in a $D$ dimensional space. This means that each data point is a point in a $D$ dimensional space and the data set can be presented as an $N$ by $D$ matrix.\n",
    " \n",
    "The covariance of the data set can be calculated as:\n",
    "\n",
    "\n",
    "$$\\large \\Sigma = \\left( {\\matrix{   {E{{({x_1} - {\\mu _1})}^2}} & {...} & {E\\left( {({x_1} - {\\mu _1})({x_d} - {\\mu _d})} \\right)}  \\cr     \\vdots  &  \\cdots  &  \\vdots   \\cr    {E\\left( {({x_d} - {\\mu _d})(x{}_1 - {\\mu _1})} \\right)} &  \\ldots  & {E{{({x_d} - {\\mu _d})}^2}}  \\cr  } } \\right)$$\n",
    "\n",
    "\n",
    "where $E$ is the expected value, $x_i$ represents the $i_th$ dimension and ${\\mu _i}$ is the average value for the $i_th$ dimension. \n",
    "\n",
    "The covariance matrix can also be formulated in the matrix form as:\n",
    "\n",
    "$$\\large \\Sigma={1 \\over N}\\sum\\limits_{i = 1}^N {({{\\bf{X}}} - {\\bf{ \\mu}}){{({\\bf{X}} - {\\bf{ \\mu}})}^T}} $$\n",
    "\n",
    "Let's calculate the covariance of the above example, i.e., height, weight, and waist of 5 persons:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Using matplotlib backend: Qt4Agg\n",
      "Using matplotlib backend: Qt4Agg\n"
     ]
    }
   ],
   "source": [
    "# imports\n",
    "try:\n",
    "    if __IPYTHON__:\n",
    "        from IPython import get_ipython\n",
    "\n",
    "        get_ipython().magic('matplotlib')\n",
    "        from ipython_utilities import *\n",
    "        from ipywidgets import interact, interactive, fixed, \\\n",
    "            FloatSlider, IntSlider, FloatRangeSlider, Label\n",
    "        from IPython.display import display, HTML\n",
    "        in_ipython_flag = True\n",
    "except:\n",
    "    in_ipython_flag = False\n",
    "import cv2 as cv\n",
    "from matplotlib import pyplot as plt\n",
    "import numpy as np\n",
    "from threading import Thread"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "Original Data Set<table><tr><td>65.00</td><td>72.00</td><td>68.00</td><td>80.00</td><td>66.00</td></tr><tr><td>150.00</td><td>180.00</td><td>156.00</td><td>190.00</td><td>152.00</td></tr><tr><td>30.00</td><td>34.00</td><td>32.00</td><td>38.00</td><td>36.00</td></tr></table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "Mean<table><tr><td>70.20</td></tr><tr><td>165.60</td></tr><tr><td>34.00</td></tr></table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "Zero mean data set<table><tr><td>-5.20</td><td>1.80</td><td>-2.20</td><td>9.80</td><td>-4.20</td></tr><tr><td>-15.60</td><td>14.40</td><td>-9.60</td><td>24.40</td><td>-13.60</td></tr><tr><td>-4.00</td><td>0.00</td><td>-2.00</td><td>4.00</td><td>2.00</td></tr></table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "Covariance Matrix<table><tr><td>37.20</td><td>106.10</td><td>14.00</td></tr><tr><td>106.10</td><td>330.80</td><td>38.00</td></tr><tr><td>14.00</td><td>38.00</td><td>10.00</td></tr></table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "data=np.array([[65,72,68,80,66],[150,180,156,190,152],[30,34,32,38,36]])\n",
    "mean=np.mean(data,1) # Mean for each dimension\n",
    "zero_mean_data=data-mean[:, np.newaxis]\n",
    "covariance_matrix=np.cov(data)\n",
    "display_as_html_table(data, \"Original Data Set\")\n",
    "mean=mean[np.newaxis, :].T  \n",
    "display_as_html_table(mean, \"Mean\")\n",
    "display_as_html_table(zero_mean_data, \"Zero mean data set\")\n",
    "display_as_html_table(covariance_matrix, \"Covariance Matrix\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from numpy.random import standard_normal\n",
    "from matplotlib.patches import Ellipse\n",
    "from numpy.linalg import svd\n",
    "\n",
    "\n",
    "def plot_2d_pca(mean_x,\n",
    "                mean_y,\n",
    "                sigma_x,\n",
    "                sigma_y,\n",
    "                rotation_angle,\n",
    "                center=True):\n",
    "    rotation_angle=np.pi*rotation_angle/180.\n",
    "    mean = np.array([mean_x, mean_y])\n",
    "    sigma = np.array([sigma_x, sigma_y])\n",
    "    rotation_matrix = np.array([[np.cos(rotation_angle), -np.sin(rotation_angle)],\n",
    "                                [np.sin(rotation_angle), np.cos(rotation_angle)]])\n",
    "    data_set = np.dot(standard_normal((1000, 2)) * sigma[np.newaxis, :], rotation_matrix.T) + mean[np.newaxis, :]\n",
    "\n",
    "    figure_labels = plt.get_figlabels()\n",
    "    if figure_labels or figure_labels != \"PCA Demo\":\n",
    "        fig = plt.figure(\"PCA Demo\", figsize=(8, 8))\n",
    "        ax = fig.add_subplot(111)\n",
    "    else:\n",
    "        plt.figure(\"PCA Demo\")\n",
    "    ax.clear()\n",
    "    ax.scatter(data_set[:200, 0], data_set[:200, 1], marker='*')\n",
    "    ax.grid()\n",
    "    limit = 10.0\n",
    "    ax.set_xlim([-limit, limit])\n",
    "    ax.set_ylim([-limit, limit])\n",
    "#     ellipse = Ellipse(xy=np.array([mean_x, mean_y]), width=sigma_x * 3, height=sigma_y * 3, angle=rotation_angle / np.pi * 180,\n",
    "#                 facecolor=[1.0, 0, 0], alpha=0.3)\n",
    "#     ax.add_artist(ellipse)\n",
    "    if center:\n",
    "        X_mean = data_set.mean(axis=0, keepdims=True)\n",
    "    else:\n",
    "        X_mean = np.zeros((1, 2))\n",
    "    U, s, V = svd(data_set - X_mean, full_matrices=False)\n",
    "    for v in np.dot(np.diag(s / np.sqrt(data_set.shape[0])), V):  # Each eigenvector\n",
    "        ax.arrow(X_mean[0, 0], X_mean[0, 1], -v[0], -v[1], width=0.02,\n",
    "                 head_width=0.1, head_length=0.1, fc='r', ec='b')\n",
    "    plt.show()\n",
    "\n",
    "\n",
    "controls = interactive(plot_2d_pca,\n",
    "                       mean_x=FloatSlider(min=-10.0, max=10.0, value=0),\n",
    "                       mean_y=FloatSlider(min=-10.0, max=10.0, value=0),\n",
    "                       sigma_x=FloatSlider(min=0.1, max=4, value=1.0),\n",
    "                       sigma_y=FloatSlider(min=0.1, max=4, value=0.5),\n",
    "                       rotation_angle=FloatSlider(min=0.0, max=180, value=30.0),\n",
    "                       center=True);\n",
    "arrange_widgets_in_grid(controls)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [conda root]",
   "language": "python",
   "name": "conda-root-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  },
  "widgets": {
   "state": {
    "0041d296876c4fa7b093f9761f422e52": {
     "views": [
      {
       "cell_index": 7
      }
     ]
    }
   },
   "version": "1.2.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
